LAB4¶

     Project done by Michela Pirozzi MAT:732531 and Sara Ferioli MAT:733105

Goal¶

  • Given the features in the dataset, what is the best combination to have a new contract? Was the feature combination the same even before covid?

  • Predict number of contracts. Is the number of contracts predicted for the future in 2019 the same or similar to the actual data given the occurrence of COVID?

Import data¶

In [12]:
# Load dataset
attivati = pd.read_csv("Rapporti_di_lavoro_attivati.csv")
attivati.head()
Out[12]:
DATA GENERE ETA SETTOREECONOMICODETTAGLIO TITOLOSTUDIO CONTRATTO MODALITALAVORO PROVINCIAIMPRESA ITALIANO
0 09/05/2020 F 60 Attività di famiglie e convivenze come datori ... NESSUN TITOLO DI STUDIO LAVORO DOMESTICO TEMPO PIENO BERGAMO UCRAINA
1 12/07/2019 M 43 Gestioni di funicolari, ski-lift e seggiovie s... LICENZA MEDIA LAVORO A TEMPO DETERMINATO TEMPO PIENO BERGAMO ITALIA
2 05/06/2013 F 20 Fabbricazione di altre apparecchiature elettri... LICENZA MEDIA APPRENDISTATO PROFESSIONALIZZANTE O CONTRATTO ... TEMPO PIENO BERGAMO ITALIA
3 12/03/2010 F 28 Alberghi DIPLOMA DI ISTRUZIONE SECONDARIA SUPERIORE CH... LAVORO INTERMITTENTE A TEMPO DETERMINATO NON DEFINITO BERGAMO ITALIA
4 06/04/2021 F 49 Rifugi di montagna LICENZA MEDIA LAVORO INTERMITTENTE NON DEFINITO BERGAMO ITALIA

Data analysis¶

Contracts during the years¶

In [14]:
# Show the graph
time_for_column(dati,"CONTRATTO","DATA")

Which sectors are most affected by covid?¶

In [16]:
# Show the graph
px.bar(df_merge_col, 'SETTOREECONOMICODETTAGLIO', 'count', 
       color='SETTOREECONOMICODETTAGLIO', animation_frame='DATA',
       category_orders={'DATA':['2018', '2019', '2020', '2021']}, title='', range_y=[0, 60000])

Results¶

We can see that housework ('Attività di famiglie e convivenze come datori di lavoro per personale domestico') during the covid has increased. Instead, hotel ('Alberghi') and catering ('Ristorazione con somministrazione') sector have decreased.

Geographical representation¶

In [18]:
# Show the graph
map_fig(dati,"CONTRATTO")

How covid has affected the active contract according to age and gender?¶

In [21]:
fig.show()

Comparison between extended and terminated contract during covid.¶

In [24]:
# Show the graph
line_fig(prorogati_merge_col,cessati_merge_col,"DATA","prorogati","cessati")

Which sectors have more or less active, extended or terminated constract, during covid?¶

In [26]:
print(f'The sector with the max number of activated contracts in 2020 is "{idmax_a20}" with {max_a20} contracts')
print(f'The sector with the min number of activated contracts in 2020 is "{idmin_a20}" with {min_a20} contracts')

print(f'The sector with the max number of activated contracts in 2021 is "{idmax_a21}" with {max_a21} contracts')
print(f'The sector with the min number of activated contracts in 2021 is "{idmin_a21}" with {min_a21} contracts')
The sector with the max number of activated contracts in 2020 is "Attività di famiglie e convivenze come datori di lavoro per personale domestico" with 49228 contracts
The sector with the min number of activated contracts in 2020 is "Trasporto mediante condotte di liquidi" with 1 contracts
The sector with the max number of activated contracts in 2021 is "Attività di produzione cinematografica, di video e di programmi televisivi" with 33020 contracts
The sector with the min number of activated contracts in 2021 is "Trasporto mediante condotte di liquidi" with 1 contracts

Transformations¶

Adding Ateco¶

In [28]:
join.head()
Out[28]:
GENERE ETA TITOLOSTUDIO CONTRATTO MODALITALAVORO PROVINCIAIMPRESA ITALIANO mese-anno ANNO Codice_ateco SETTOREECONOMICODETTAGLIO_y
0 F 60 NESSUN TITOLO DI STUDIO LAVORO DOMESTICO TEMPO PIENO BERGAMO UCRAINA 2020-05 2020 97 ATTIVITÀ DI FAMIGLIE E CONVIVENZE COME DATORI ...
1 F 33 NESSUN TITOLO DI STUDIO LAVORO DOMESTICO A TEMPO DETERMINATO TEMPO PARZIALE ORIZZONTALE BERGAMO HONDURAS 2012-07 2012 97 ATTIVITÀ DI FAMIGLIE E CONVIVENZE COME DATORI ...
2 F 45 NESSUN TITOLO DI STUDIO LAVORO DOMESTICO TEMPO PIENO BERGAMO ITALIA 2019-04 2019 97 ATTIVITÀ DI FAMIGLIE E CONVIVENZE COME DATORI ...
3 F 61 NESSUN TITOLO DI STUDIO LAVORO DOMESTICO TEMPO PIENO LECCO UCRAINA 2014-09 2014 97 ATTIVITÀ DI FAMIGLIE E CONVIVENZE COME DATORI ...
4 F 20 NESSUN TITOLO DI STUDIO LAVORO DOMESTICO TEMPO PARZIALE ORIZZONTALE LECCO ITALIA 2014-05 2014 97 ATTIVITÀ DI FAMIGLIE E CONVIVENZE COME DATORI ...

Other transformations¶

In [30]:
indeterminato.head()
Out[30]:
GENERE ETA TITOLOSTUDIO CONTRATTO MODALITALAVORO PROVINCIAIMPRESA mese-anno ANNO Codice_ateco SETTOREECONOMICODETTAGLIO_y ITALIANO
0 F 60 NESSUN TITOLO DI STUDIO NON INDETERMINATO TEMPO PIENO BERGAMO 2020-05 2020 97 ATTIVITÀ DI FAMIGLIE E CONVIVENZE COME DATORI ... UCRAINA
1 F 61 NESSUN TITOLO DI STUDIO NON INDETERMINATO TEMPO PIENO LECCO 2014-09 2014 97 ATTIVITÀ DI FAMIGLIE E CONVIVENZE COME DATORI ... UCRAINA
2 F 29 NESSUN TITOLO DI STUDIO NON INDETERMINATO TEMPO PARZIALE ORIZZONTALE LECCO 2017-06 2017 97 ATTIVITÀ DI FAMIGLIE E CONVIVENZE COME DATORI ... UCRAINA
3 F 48 NESSUN TITOLO DI STUDIO NON INDETERMINATO TEMPO PARZIALE ORIZZONTALE BRESCIA 2017-08 2017 97 ATTIVITÀ DI FAMIGLIE E CONVIVENZE COME DATORI ... UCRAINA
4 F 54 NESSUN TITOLO DI STUDIO NON INDETERMINATO TEMPO PARZIALE ORIZZONTALE BRESCIA 2020-01 2020 97 ATTIVITÀ DI FAMIGLIE E CONVIVENZE COME DATORI ... UCRAINA

Label Encoder¶

In [32]:
copia.head()
Out[32]:
GENERE ETA TITOLOSTUDIO CONTRATTO MODALITALAVORO PROVINCIAIMPRESA mese-anno ANNO Codice_ateco SETTOREECONOMICODETTAGLIO_y ITALIANO titolostudio_transformed modalitalavoro_transformed provincia_transformed nazionalita_transformed
0 F 60 NESSUN TITOLO DI STUDIO NON INDETERMINATO TEMPO PIENO BERGAMO 2020-05 2020 97 ATTIVITÀ DI FAMIGLIE E CONVIVENZE COME DATORI ... UCRAINA 8 4 0 57
1 F 61 NESSUN TITOLO DI STUDIO NON INDETERMINATO TEMPO PIENO LECCO 2014-09 2014 97 ATTIVITÀ DI FAMIGLIE E CONVIVENZE COME DATORI ... UCRAINA 8 4 4 57
2 F 29 NESSUN TITOLO DI STUDIO NON INDETERMINATO TEMPO PARZIALE ORIZZONTALE LECCO 2017-06 2017 97 ATTIVITÀ DI FAMIGLIE E CONVIVENZE COME DATORI ... UCRAINA 8 2 4 57
3 F 48 NESSUN TITOLO DI STUDIO NON INDETERMINATO TEMPO PARZIALE ORIZZONTALE BRESCIA 2017-08 2017 97 ATTIVITÀ DI FAMIGLIE E CONVIVENZE COME DATORI ... UCRAINA 8 2 1 57
4 F 54 NESSUN TITOLO DI STUDIO NON INDETERMINATO TEMPO PARZIALE ORIZZONTALE BRESCIA 2020-01 2020 97 ATTIVITÀ DI FAMIGLIE E CONVIVENZE COME DATORI ... UCRAINA 8 2 1 57

Encoding Manually Ordinal Categorical Features¶

In [34]:
copia.head()
Out[34]:
GENERE ETA TITOLOSTUDIO CONTRATTO MODALITALAVORO PROVINCIAIMPRESA mese-anno ANNO Codice_ateco SETTOREECONOMICODETTAGLIO_y ITALIANO titolostudio_transformed modalitalavoro_transformed provincia_transformed nazionalita_transformed contratto_transformed genere_transformed
0 F 60 NESSUN TITOLO DI STUDIO NON INDETERMINATO TEMPO PIENO BERGAMO 2020-05 2020 97 ATTIVITÀ DI FAMIGLIE E CONVIVENZE COME DATORI ... UCRAINA 8 4 0 57 0 1
1 F 61 NESSUN TITOLO DI STUDIO NON INDETERMINATO TEMPO PIENO LECCO 2014-09 2014 97 ATTIVITÀ DI FAMIGLIE E CONVIVENZE COME DATORI ... UCRAINA 8 4 4 57 0 1
2 F 29 NESSUN TITOLO DI STUDIO NON INDETERMINATO TEMPO PARZIALE ORIZZONTALE LECCO 2017-06 2017 97 ATTIVITÀ DI FAMIGLIE E CONVIVENZE COME DATORI ... UCRAINA 8 2 4 57 0 1
3 F 48 NESSUN TITOLO DI STUDIO NON INDETERMINATO TEMPO PARZIALE ORIZZONTALE BRESCIA 2017-08 2017 97 ATTIVITÀ DI FAMIGLIE E CONVIVENZE COME DATORI ... UCRAINA 8 2 1 57 0 1
4 F 54 NESSUN TITOLO DI STUDIO NON INDETERMINATO TEMPO PARZIALE ORIZZONTALE BRESCIA 2020-01 2020 97 ATTIVITÀ DI FAMIGLIE E CONVIVENZE COME DATORI ... UCRAINA 8 2 1 57 0 1

Remove outliers¶

In [36]:
transformed.head()
Out[36]:
ETA mese-anno ANNO Codice_ateco titolostudio_transformed modalitalavoro_transformed provincia_transformed nazionalita_transformed contratto_transformed genere_transformed
0 60 2020-05 2020 97 8 4 0 57 0 1
1 61 2014-09 2014 97 8 4 4 57 0 1
2 29 2017-06 2017 97 8 2 4 57 0 1
3 48 2017-08 2017 97 8 2 1 57 0 1
4 54 2020-01 2020 97 8 2 1 57 0 1

Balancing¶

Random Undersampling¶

In [37]:
# Prima del Balancing
# Check if the dataset is balanced
transformed["contratto_transformed"].value_counts(normalize=True)
Out[37]:
0    0.855403
1    0.144597
Name: contratto_transformed, dtype: float64
In [39]:
# Check if the dataset is balanced
balanced["contratto_transformed"].value_counts(normalize=True)
Out[39]:
0    0.5
1    0.5
Name: contratto_transformed, dtype: float64

Prediction¶

Given the features in the dataset, what is the best combination to have a new contract? Was the feature combination the same even before covid?¶

Decision Tree¶

In [41]:
# Find the max depth neccessary for the Decision tree
find_best_max_depth(X_train,y_train,X_test,y_test)
In [ ]:
# Show Decision tree
plt.figure(figsize=(4,4), dpi=1000)
plot_tree(dct,
         feature_names=["ETA","ANNO","contratto_transformed","nazionalita_transformed","genere_transformed"],
          filled=True,)

plt.show()
In [44]:
# Probablistic prediction the values
y_pred_prob = dct.predict_proba(X_test)
print(y_pred_prob)
[[0.01872659 0.00374532 0.         ... 0.01123596 0.00374532 0.        ]
 [0.00722022 0.         0.         ... 0.00120337 0.         0.        ]
 [0.01372549 0.         0.         ... 0.01470588 0.00588235 0.        ]
 ...
 [0.00215517 0.         0.         ... 0.01508621 0.         0.        ]
 [0.00923788 0.         0.         ... 0.00923788 0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]
In [45]:
# Predict the values
y_pred = dct.predict(X_test)
print(y_pred)
[53 62 85 ... 85 41 41]

Confusion matrix¶

In [47]:
# Print the confusion matrix
confusion = metrics.confusion_matrix(y_test, y_pred)
print(confusion)
[[ 936    0    0 ...   17  385    0]
 [   8    0    0 ...    0    0    0]
 [   2    0    2 ...    0    1    0]
 ...
 [  63    0    0 ... 2511  132    0]
 [ 594    0    1 ...  814 4807   12]
 [   5    0    0 ...    2   21    0]]
In [49]:
# Show the confusion matrix
fig=px.imshow(normalized,color_continuous_scale='blues')
fig.show()

Our prediction¶

In [50]:
# Prediction with our data: 
    #23 = age
    #2021 = year in which you want the contract
    #1 = indefinite contract
    #27 = Italy 
    #1 = female
X_mie=[[23,2021,1,27,1]]
In [51]:
# Probablistic prediction the values
y_pred_prob_mie = dct.predict_proba(X_mie)
print(y_pred_prob_mie)
y_pred_prob_mie = pd.DataFrame(y_pred_prob_mie)
y_conv=pd.DataFrame(dct.classes_)
app= pd.concat([y_pred_prob_mie,y_conv], axis=1)
df=pd.DataFrame(app)
df= df.dropna()
df=df.loc[:, (df != 0).any(axis=0)]

if (df.iloc[: , -1:].columns == 0):
    df = df.iloc[: , :-1]
    
df 
[[0.         0.         0.         0.         0.         0.
  0.         0.         0.01302083 0.         0.         0.00260417
  0.00520833 0.         0.         0.00260417 0.00260417 0.
  0.0078125  0.00520833 0.01822917 0.         0.         0.03125
  0.00260417 0.00520833 0.01302083 0.01041667 0.         0.00520833
  0.00260417 0.         0.         0.         0.         0.
  0.         0.00520833 0.         0.01041667 0.00260417 0.03125
  0.05989583 0.015625   0.         0.00260417 0.02864583 0.02604167
  0.0078125  0.0859375  0.00260417 0.         0.00260417 0.00260417
  0.03385417 0.0078125  0.015625   0.         0.0078125  0.0078125
  0.02083333 0.0078125  0.0078125  0.00260417 0.00260417 0.0078125
  0.         0.         0.         0.00260417 0.         0.00260417
  0.0078125  0.01822917 0.02083333 0.26041667 0.02604167 0.07552083
  0.         0.         0.         0.         0.0078125  0.
  0.04427083 0.         0.        ]]
Out[51]:
8 11 12 15 16 18 19 20 23 24 ... 69 71 72 73 74 75 76 77 82 84
0 0.013021 0.002604 0.005208 0.002604 0.002604 0.007812 0.005208 0.018229 0.03125 0.002604 ... 0.002604 0.002604 0.007812 0.018229 0.020833 0.260417 0.026042 0.075521 0.007812 0.044271

1 rows × 50 columns

In [53]:
# Sectors in which there is a possibility to get a new contract
sett_pred = []
for x in df.columns:
    el = ateco[ateco["Codice_ateco"] == str(x)]
    sett_pred.extend(el["SETTOREECONOMICODETTAGLIO"])
sett_pred
Out[53]:
['INDUSTRIA DELLE BEVANDE',
 'INDUSTRIA DEL TABACCO',
 'FABBRICAZIONE DI ARTICOLI IN PELLE E SIMILI',
 'INDUSTRIA DEL LEGNO E DEI PRODOTTI IN LEGNO E SUGHERO (ESCLUSI I MOBILI); FABBRICAZIONE DI ARTICOLI IN PAGLIA E MATERIALI DA INTRECCIO',
 'STAMPA E RIPRODUZIONE DI SUPPORTI REGISTRATI',
 'FABBRICAZIONE DI COKE E PRODOTTI DERIVANTI DALLA RAFFINAZIONE DEL PETROLIO',
 'FABBRICAZIONE DI PRODOTTI CHIMICI',
 'FABBRICAZIONE DI ALTRI PRODOTTI DELLA LAVORAZIONE DI MINERALI NON METALLIFERI',
 'METALLURGIA',
 'FABBRICAZIONE DI PRODOTTI IN METALLO (ESCLUSI MACCHINARI E ATTREZZATURE)',
 'FABBRICAZIONE DI COMPUTER E PRODOTTI DI ELETTRONICA E OTTICA; APPARECCHI ELETTROMEDICALI, APPARECCHI DI MISURAZIONE E DI OROLOGI',
 'FABBRICAZIONE DI APPARECCHIATURE ELETTRICHE ED APPARECCHIATURE PER USO DOMESTICO NON ELETTRICHE',
 'FABBRICAZIONE DI AUTOVEICOLI, RIMORCHI E SEMIRIMORCHI',
 'FABBRICAZIONE DI ALTRI MEZZI DI TRASPORTO',
 'GESTIONE DELLE RETI FOGNARIE',
 'ATTIVITÀ DI RISANAMENTO E ALTRI SERVIZI DI GESTIONE DEI RIFIUTI',
 'COSTRUZIONE DI EDIFICI',
 'INGEGNERIA CIVILE',
 'LAVORI DI COSTRUZIONE SPECIALIZZATI',
 "COMMERCIO ALL'INGROSSO E AL DETTAGLIO E RIPARAZIONE DI AUTOVEICOLI E MOTOCICLI",
 "COMMERCIO ALL'INGROSSO (ESCLUSO QUELLO DI AUTOVEICOLI E DI MOTOCICLI)",
 'COMMERCIO AL DETTAGLIO (ESCLUSO QUELLO DI AUTOVEICOLI E DI MOTOCICLI)',
 'TRASPORTO TERRESTRE E TRASPORTO MEDIANTE CONDOTTE',
 "TRASPORTO MARITTIMO E PER VIE D'ACQUA",
 'MAGAZZINAGGIO E ATTIVITÀ DI SUPPORTO AI TRASPORTI',
 'SERVIZI POSTALI E ATTIVITÀ DI CORRIERE',
 'ALLOGGIO',
 'ATTIVITÀ DEI SERVIZI DI RISTORAZIONE',
 'ATTIVITÀ EDITORIALI',
 'ATTIVITÀ DI PRODUZIONE CINEMATOGRAFICA, DI VIDEO E DI PROGRAMMI TELEVISIVI, DI REGISTRAZIONI MUSICALI E SONORE',
 'ATTIVITÀ DI PROGRAMMAZIONE E TRASMISSIONE',
 'TELECOMUNICAZIONI',
 'PRODUZIONE DI SOFTWARE, CONSULENZA INFORMATICA E ATTIVITÀ CONNESSE',
 "ATTIVITÀ DEI SERVIZI D'INFORMAZIONE E ALTRI SERVIZI INFORMATICI",
 'ATTIVITÀ DI SERVIZI FINANZIARI (ESCLUSE LE ASSICURAZIONI E I FONDI PENSIONE)',
 'ASSICURAZIONI, RIASSICURAZIONI E FONDI PENSIONE (ESCLUSE LE ASSICURAZIONI SOCIALI OBBLIGATORIE)',
 'ATTIVITÀ LEGALI E CONTABILITÀ',
 "ATTIVITÀ DEGLI STUDI DI ARCHITETTURA E D'INGEGNERIA; COLLAUDI ED ANALISI TECNICHE",
 'RICERCA SCIENTIFICA E SVILUPPO',
 'PUBBLICITÀ E RICERCHE DI MERCATO',
 'ALTRE ATTIVITÀ PROFESSIONALI, SCIENTIFICHE E TECNICHE',
 'SERVIZI VETERINARI',
 'ATTIVITÀ DI NOLEGGIO E LEASING OPERATIVO',
 "ATTIVITÀ DI SUPPORTO PER LE FUNZIONI D'UFFICIO E ALTRI SERVIZI DI SUPPORTO ALLE IMPRESE",
 'AMMINISTRAZIONE PUBBLICA E DIFESA; ASSICURAZIONE SOCIALE OBBLIGATORIA']

What if we were males?¶

In [54]:
# Prediction with our data: 
    #23 = age
    #2021 = year in which you want the contract
    #1 = indefinite contract
    #27 = Italy 
    #0 = male
X_male=[[23,2021,1,27,0]]
In [55]:
# Probablistic prediction the values
y_pred_prob_male = dct.predict_proba(X_male)
print(y_pred_prob_male)
y_pred_prob_male = pd.DataFrame(y_pred_prob_male)
y_conv=pd.DataFrame(dct.classes_)
app= pd.concat([y_pred_prob_male,y_conv], axis=1)
df=pd.DataFrame(app)
df= df.dropna()
df=df.loc[:, (df != 0).any(axis=0)]

if (df.iloc[: , -1:].columns == 0):
    df = df.iloc[: , :-1]
    
df 
[[0.00568182 0.         0.         0.         0.         0.
  0.         0.         0.01325758 0.         0.         0.00378788
  0.00189394 0.         0.00189394 0.00568182 0.00378788 0.
  0.02083333 0.00378788 0.03598485 0.0094697  0.00757576 0.06818182
  0.00568182 0.01515152 0.07386364 0.01136364 0.00568182 0.00378788
  0.00568182 0.0094697  0.00189394 0.00378788 0.         0.0094697
  0.         0.01704545 0.00568182 0.04356061 0.01515152 0.05681818
  0.02651515 0.02840909 0.         0.00189394 0.02840909 0.09280303
  0.00189394 0.05681818 0.00378788 0.00189394 0.         0.
  0.11363636 0.00568182 0.01325758 0.         0.0094697  0.00189394
  0.00378788 0.00568182 0.00568182 0.00378788 0.00568182 0.00378788
  0.         0.00568182 0.00189394 0.         0.00378788 0.01515152
  0.01325758 0.02272727 0.         0.03030303 0.00189394 0.00757576
  0.00189394 0.         0.         0.00378788 0.00189394 0.00189394
  0.00568182 0.00189394 0.        ]]
Out[55]:
0 8 11 12 14 15 16 18 19 20 ... 73 75 76 77 78 81 82 83 84 85
0 0.005682 0.013258 0.003788 0.001894 0.001894 0.005682 0.003788 0.020833 0.003788 0.035985 ... 0.022727 0.030303 0.001894 0.007576 0.001894 0.003788 0.001894 0.001894 0.005682 0.001894

1 rows × 64 columns

In [56]:
# Sectors in which there is a possibility to get a new contract
sett_pred = []
for x in df.columns:
    el = ateco[ateco["Codice_ateco"] == str(x)]
    sett_pred.extend(el["SETTOREECONOMICODETTAGLIO"])
sett_pred
Out[56]:
['INDUSTRIA DELLE BEVANDE',
 'INDUSTRIA DEL TABACCO',
 'CONFEZIONE DI ARTICOLI DI ABBIGLIAMENTO; CONFEZIONE DI ARTICOLI IN PELLE E PELLICCIA',
 'FABBRICAZIONE DI ARTICOLI IN PELLE E SIMILI',
 'INDUSTRIA DEL LEGNO E DEI PRODOTTI IN LEGNO E SUGHERO (ESCLUSI I MOBILI); FABBRICAZIONE DI ARTICOLI IN PAGLIA E MATERIALI DA INTRECCIO',
 'STAMPA E RIPRODUZIONE DI SUPPORTI REGISTRATI',
 'FABBRICAZIONE DI COKE E PRODOTTI DERIVANTI DALLA RAFFINAZIONE DEL PETROLIO',
 'FABBRICAZIONE DI PRODOTTI CHIMICI',
 'FABBRICAZIONE DI PRODOTTI FARMACEUTICI DI BASE E DI PREPARATI FARMACEUTICI',
 'FABBRICAZIONE DI ARTICOLI IN GOMMA E MATERIE PLASTICHE',
 'FABBRICAZIONE DI ALTRI PRODOTTI DELLA LAVORAZIONE DI MINERALI NON METALLIFERI',
 'METALLURGIA',
 'FABBRICAZIONE DI PRODOTTI IN METALLO (ESCLUSI MACCHINARI E ATTREZZATURE)',
 'FABBRICAZIONE DI COMPUTER E PRODOTTI DI ELETTRONICA E OTTICA; APPARECCHI ELETTROMEDICALI, APPARECCHI DI MISURAZIONE E DI OROLOGI',
 'FABBRICAZIONE DI APPARECCHIATURE ELETTRICHE ED APPARECCHIATURE PER USO DOMESTICO NON ELETTRICHE',
 'FABBRICAZIONE DI MACCHINARI ED APPARECCHIATURE NCA',
 'FABBRICAZIONE DI AUTOVEICOLI, RIMORCHI E SEMIRIMORCHI',
 'FABBRICAZIONE DI ALTRI MEZZI DI TRASPORTO',
 'FABBRICAZIONE DI MOBILI',
 'ALTRE INDUSTRIE MANIFATTURIERE',
 'RIPARAZIONE, MANUTENZIONE ED INSTALLAZIONE DI MACCHINE ED APPARECCHIATURE',
 'FORNITURA DI ENERGIA ELETTRICA, GAS, VAPORE E ARIA CONDIZIONATA',
 'GESTIONE DELLE RETI FOGNARIE',
 'ATTIVITÀ DI RACCOLTA, TRATTAMENTO E SMALTIMENTO DEI RIFIUTI; RECUPERO DEI MATERIALI',
 'ATTIVITÀ DI RISANAMENTO E ALTRI SERVIZI DI GESTIONE DEI RIFIUTI',
 'COSTRUZIONE DI EDIFICI',
 'INGEGNERIA CIVILE',
 'LAVORI DI COSTRUZIONE SPECIALIZZATI',
 "COMMERCIO ALL'INGROSSO E AL DETTAGLIO E RIPARAZIONE DI AUTOVEICOLI E MOTOCICLI",
 "COMMERCIO ALL'INGROSSO (ESCLUSO QUELLO DI AUTOVEICOLI E DI MOTOCICLI)",
 'COMMERCIO AL DETTAGLIO (ESCLUSO QUELLO DI AUTOVEICOLI E DI MOTOCICLI)',
 'TRASPORTO TERRESTRE E TRASPORTO MEDIANTE CONDOTTE',
 "TRASPORTO MARITTIMO E PER VIE D'ACQUA",
 'TRASPORTO AEREO',
 'ALLOGGIO',
 'ATTIVITÀ DEI SERVIZI DI RISTORAZIONE',
 'ATTIVITÀ EDITORIALI',
 'ATTIVITÀ DI PRODUZIONE CINEMATOGRAFICA, DI VIDEO E DI PROGRAMMI TELEVISIVI, DI REGISTRAZIONI MUSICALI E SONORE',
 'ATTIVITÀ DI PROGRAMMAZIONE E TRASMISSIONE',
 'TELECOMUNICAZIONI',
 'PRODUZIONE DI SOFTWARE, CONSULENZA INFORMATICA E ATTIVITÀ CONNESSE',
 "ATTIVITÀ DEI SERVIZI D'INFORMAZIONE E ALTRI SERVIZI INFORMATICI",
 'ATTIVITÀ DI SERVIZI FINANZIARI (ESCLUSE LE ASSICURAZIONI E I FONDI PENSIONE)',
 'ASSICURAZIONI, RIASSICURAZIONI E FONDI PENSIONE (ESCLUSE LE ASSICURAZIONI SOCIALI OBBLIGATORIE)',
 'ATTIVITÀ IMMOBILIARI',
 'ATTIVITÀ DI DIREZIONE AZIENDALE E DI CONSULENZA GESTIONALE ',
 "ATTIVITÀ DEGLI STUDI DI ARCHITETTURA E D'INGEGNERIA; COLLAUDI ED ANALISI TECNICHE",
 'RICERCA SCIENTIFICA E SVILUPPO',
 'PUBBLICITÀ E RICERCHE DI MERCATO',
 'SERVIZI VETERINARI',
 'ATTIVITÀ DI NOLEGGIO E LEASING OPERATIVO',
 'ATTIVITÀ DI RICERCA, SELEZIONE, FORNITURA DI PERSONALE ',
 'ATTIVITÀ DI SERVIZI PER EDIFICI E PAESAGGIO',
 "ATTIVITÀ DI SUPPORTO PER LE FUNZIONI D'UFFICIO E ALTRI SERVIZI DI SUPPORTO ALLE IMPRESE",
 'AMMINISTRAZIONE PUBBLICA E DIFESA; ASSICURAZIONE SOCIALE OBBLIGATORIA',
 'ISTRUZIONE']

Best Depth Tree¶

In [ ]:
plt.figure(figsize=(4,4), dpi=1000)
plot_tree(best_max_depth_tree,
         feature_names=["ETA","ANNO","contratto_transformed","nazionalita_transformed","genere_transformed"],
          filled=True,)

plt.show()
In [58]:
# Show the best parameter
print(max_depth_grid_search.best_params_)
{'max_depth': 12}

K-Nearest Neighbors Classification¶

In [ ]:
# Print the confusion matrix
confusion = metrics.confusion_matrix(y_train, y_pred)
print(confusion)
In [ ]:
# Show the confusion matrix 
fig=px.imshow(normalized,color_continuous_scale='blues')
fig.show()

Future Prediction¶

Predict number of contracts. Is the number of contracts predicted for the future in 2019 the same or similar to the actual data given the occurrence of COVID?¶

ARIMA¶

In [41]:
# Show count of contract for each mese-anno
f, ax1 = plt.subplots(1,1,figsize=(15,5))
bal2.plot(ax=ax1)
ax1.set_xlabel("time")
ax1.set_ylabel("Number of Contracts")
plt.grid(True)
In [43]:
print('ADF Statistic: %f' % results[0])
print('p-value: %f' % results[1])
ADF Statistic: -4.613666
p-value: 0.000122
In [45]:
fig, ax = plt.subplots(4, 1, figsize=(15, 6))
decomposed_add.observed.plot(ax = ax[0])
decomposed_add.trend.plot(ax = ax[1])
decomposed_add.seasonal.plot(ax = ax[2])
decomposed_add.resid.plot(ax = ax[3])
ax[0].set_ylabel('')
ax[1].set_ylabel('Trend')
ax[2].set_ylabel('Seasonal')
ax[3].set_ylabel('Residual')
plt.tight_layout()
plt.show()
In [47]:
plt.figure(figsize=(12,5))
ax1 = bal_diff.plot()
ax1.set_xlabel("Anno")
ax1.set_ylabel("diff")
plt.grid(True)
plt.show()
In [49]:
print('ADF Statistic: %f' % results[0])
print('p-value: %f' % results[1])
ADF Statistic: -2.987232
p-value: 0.036104
In [51]:
print('ADF Statistic: %f' % results[0])
print("P-value of a test is: {}".format(results[1]))
ADF Statistic: -7.801078
P-value of a test is: 7.490497084599864e-12
In [52]:
# Show autocorrelation and partial correlation
fig,ax = plt.subplots(2,1,figsize=(20,10))
plot_acf(bal_diff, lags=4, ax=ax[0])
plot_pacf(bal_diff, lags=4, ax=ax[1])
plt.show()
In [53]:
arima_df = arima_aic(bal2)
arima_df
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            1     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.53563D+00    |proj g|=  0.00000D+00

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    1      0      1      0     0     0   0.000D+00   9.536D+00
  F =   9.5356311497057664     

CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL            
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            2     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.41030D+00    |proj g|=  1.82112D-03

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    2      4      7      1     0     0   1.776D-07   9.410D+00
  F =   9.4102933631734142     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            3     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.39647D+00    |proj g|=  1.64544D-03

At iterate    5    f=  9.39646D+00    |proj g|=  3.55272D-07

At iterate   10    f=  9.39646D+00    |proj g|=  8.17124D-06

At iterate   15    f=  9.39646D+00    |proj g|=  2.62901D-05

At iterate   20    f=  9.39646D+00    |proj g|=  2.54019D-05

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    3     24     36      1     0     0   1.776D-07   9.396D+00
  F =   9.3964614257279546     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            4     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.38227D+00    |proj g|=  1.75220D-03

At iterate    5    f=  9.38225D+00    |proj g|=  5.32907D-07

At iterate   10    f=  9.38225D+00    |proj g|=  2.38032D-05

At iterate   15    f=  9.38225D+00    |proj g|=  3.30402D-05

At iterate   20    f=  9.38225D+00    |proj g|=  6.18172D-05
 This problem is unconstrained.
 This problem is unconstrained.
 This problem is unconstrained.
 This problem is unconstrained.
At iterate   25    f=  9.38225D+00    |proj g|=  0.00000D+00

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    4     25     34      1     0     0   0.000D+00   9.382D+00
  F =   9.3822507141105049     

CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL            
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            2     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.49349D+00    |proj g|=  8.28138D-04

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    2      2      4      1     0     0   0.000D+00   9.493D+00
  F =   9.4934924944244568     

CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL            
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            3     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.40371D+00    |proj g|=  1.77138D-03

At iterate    5    f=  9.40370D+00    |proj g|=  7.10543D-07

At iterate   10    f=  9.40370D+00    |proj g|=  1.68754D-05

At iterate   15    f=  9.40370D+00    |proj g|=  8.17124D-06

At iterate   20    f=  9.40370D+00    |proj g|=  5.64881D-05

At iterate   25    f=  9.40370D+00    |proj g|=  1.77636D-07

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    3     25     40      1     0     0   1.776D-07   9.404D+00
  F =   9.4036971479741140     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            4     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.39168D+00    |proj g|=  1.52287D-03

At iterate    5    f=  9.39167D+00    |proj g|=  2.48690D-06
 This problem is unconstrained.
 This problem is unconstrained.
 This problem is unconstrained.
           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    4      8     12      1     0     0   0.000D+00   9.392D+00
  F =   9.3916740930360216     

CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL            
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            5     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.36722D+00    |proj g|=  1.61098D-03

At iterate    5    f=  9.36719D+00    |proj g|=  5.55289D-04

At iterate   10    f=  9.36719D+00    |proj g|=  7.10543D-06

At iterate   15    f=  9.36719D+00    |proj g|=  1.41398D-04

At iterate   20    f=  9.36715D+00    |proj g|=  1.10987D-03

At iterate   25    f=  9.36712D+00    |proj g|=  3.16192D-05

At iterate   30    f=  9.36712D+00    |proj g|=  3.55271D-07
 This problem is unconstrained.
 This problem is unconstrained.
           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    5     32     43      1     0     0   1.776D-07   9.367D+00
  F =   9.3671181930709899     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            3     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.39460D+00    |proj g|=  2.46381D-03

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    3      4      6      1     0     0   0.000D+00   9.395D+00
  F =   9.3945834705319768     

CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL            
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            4     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.37144D+00    |proj g|=  2.43237D-03

At iterate    5    f=  9.37142D+00    |proj g|=  7.10543D-06

At iterate   10    f=  9.37142D+00    |proj g|=  5.32907D-07

At iterate   15    f=  9.37142D+00    |proj g|=  5.86198D-06

At iterate   20    f=  9.37142D+00    |proj g|=  3.73035D-06

At iterate   25    f=  9.37142D+00    |proj g|=  4.08562D-06

At iterate   30    f=  9.37142D+00    |proj g|=  1.33227D-05
 This problem is unconstrained.
 This problem is unconstrained.
At iterate   35    f=  9.37142D+00    |proj g|=  3.49942D-05

At iterate   40    f=  9.37142D+00    |proj g|=  5.32907D-07

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    4     41     58      1     0     0   3.553D-07   9.371D+00
  F =   9.3714200543317165     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            5     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.37136D+00    |proj g|=  2.41887D-03

At iterate    5    f=  9.37134D+00    |proj g|=  1.66800D-04

At iterate   10    f=  9.37134D+00    |proj g|=  2.30926D-06

At iterate   15    f=  9.37134D+00    |proj g|=  2.13163D-06

At iterate   20    f=  9.37134D+00    |proj g|=  3.67706D-05

At iterate   25    f=  9.37134D+00    |proj g|=  1.27898D-05

At iterate   30    f=  9.37134D+00    |proj g|=  3.55271D-07

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    5     30     40      1     0     0   3.553D-07   9.371D+00
  F =   9.3713416165056049     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
 This problem is unconstrained.
 This problem is unconstrained.
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            4     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.36681D+00    |proj g|=  2.47837D-03

At iterate    5    f=  9.36679D+00    |proj g|=  8.88179D-07

At iterate   10    f=  9.36679D+00    |proj g|=  2.59348D-05

At iterate   15    f=  9.36679D+00    |proj g|=  9.36140D-05

At iterate   20    f=  9.36679D+00    |proj g|=  1.77636D-06

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    4     23     39      1     0     0   7.105D-07   9.367D+00
  F =   9.3667912594171181     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            5     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.35674D+00    |proj g|=  3.35003D-03

At iterate    5    f=  9.35671D+00    |proj g|=  3.32179D-05

At iterate   10    f=  9.35671D+00    |proj g|=  1.77636D-07

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    5     11     14      1     0     0   0.000D+00   9.357D+00
  F =   9.3567121127836508     

CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL            
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            7     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.33916D+00    |proj g|=  4.07532D-03

At iterate    5    f=  9.33911D+00    |proj g|=  7.81419D-04

At iterate   10    f=  9.33910D+00    |proj g|=  1.47971D-04

At iterate   15    f=  9.33910D+00    |proj g|=  5.50671D-06

At iterate   20    f=  9.33910D+00    |proj g|=  2.48690D-06

At iterate   25    f=  9.33910D+00    |proj g|=  3.19744D-05

At iterate   30    f=  9.33910D+00    |proj g|=  1.11378D-04

At iterate   35    f=  9.33910D+00    |proj g|=  3.01981D-06

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    7     37     48      1     0     0   1.776D-06   9.339D+00
  F =   9.3391012752206422     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
 This problem is unconstrained.
Out[53]:
p q aic bic sum_aic_bic
0 0 0 3284.257115 3290.552104 6574.80922
4 1 0 3271.761418 3281.203902 6552.96532
1 0 1 3243.140917 3252.5834 6495.724317
5 1 1 3242.871819 3255.461797 6498.333616
6 1 2 3240.735888 3256.47336 6497.209248
2 0 2 3240.38273 3252.972708 6493.355439
8 2 0 3239.736714 3252.326692 6492.063406
3 0 3 3237.494246 3253.231718 6490.725964
10 2 2 3235.741516 3254.626483 6490.367999
7 1 3 3234.288658 3253.173625 6487.462284
9 2 1 3233.768499 3249.505971 6483.27447
12 3 0 3232.176193 3247.913666 6480.089859
13 3 1 3230.708967 3249.593934 6480.3029
15 3 3 3228.650839 3253.830794 6482.481633
In [55]:
print(results.summary())
                                     SARIMAX Results                                      
==========================================================================================
Dep. Variable:                              count   No. Observations:                  173
Model:             SARIMAX(3, 1, 3)x(0, 1, [], 6)   Log Likelihood               -1586.377
Date:                            Wed, 01 Jun 2022   AIC                           3186.755
Time:                                    11:04:59   BIC                           3208.539
Sample:                                         0   HQIC                          3195.597
                                            - 173                                         
Covariance Type:                              opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1         -0.6172      0.135     -4.588      0.000      -0.881      -0.354
ar.L2         -0.8422      0.104     -8.060      0.000      -1.047      -0.637
ar.L3         -0.1124      0.127     -0.883      0.377      -0.362       0.137
ma.L1          0.1486      0.087      1.699      0.089      -0.023       0.320
ma.L2          0.2004      0.079      2.536      0.011       0.046       0.355
ma.L3         -0.7823      0.081     -9.642      0.000      -0.941      -0.623
sigma2      1.429e+07   1.69e-09   8.46e+15      0.000    1.43e+07    1.43e+07
===================================================================================
Ljung-Box (L1) (Q):                   0.06   Jarque-Bera (JB):                 2.06
Prob(Q):                              0.81   Prob(JB):                         0.36
Heteroskedasticity (H):               0.82   Skew:                             0.11
Prob(H) (two-sided):                  0.47   Kurtosis:                         3.50
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 3.64e+32. Standard errors may be unstable.
In [57]:
plot = results.plot_diagnostics()
In [59]:
# Plot mean SARIMA predictions
fig,ax = plt.subplots(1,1,figsize=(20,8))
plt.plot(bal2, label='original')
plt.plot(forecast.predicted_mean, label='SARIMAX', c="r")
plt.xticks(bal2.index.unique())
plt.locator_params(axis='x', nbins=10)
plt.xlabel('time')
plt.ylabel('Number of contracts')
plt.legend()
plt.grid(True)
plt.show()

Model Validation¶

In [61]:
# Plot train and test sets
plt.subplots(1,1,figsize=(20,7))
plt.plot(contract_train['count'],label='TRAIN (80%)')
plt.plot(contract_test['count'], label='TEST(20%)')
plt.legend(loc='best')
plt.xlabel('date')
plt.ylabel('count')
plt.title('Train and Test set')
plt.xticks(bal2.index.unique())
plt.locator_params(axis='x', nbins=10)
plt.grid(True)
plt.show()

Holt winter¶

In [64]:
# Plot the number of contract using holtman forecast model using Double Exponential Smoothing
fig = plt.figure(figsize=(14,5))
plt.plot(contract_train.index, contract_train['count'], label='Train')
plt.plot(contract_test.index, contract_test['count'], label='Test')
plt.plot(des_errors_df.index, des_errors_df['Predicted_Count'], label='Forecast')
plt.legend(loc='best')
plt.xlabel('date')
plt.ylabel('count')
plt.title('Forecast using Holt Winters-Double Exponential Smoothing')
plt.xticks(bal2.index.unique())
plt.locator_params(axis='x', nbins=10)
plt.grid(True)
plt.show()
In [65]:
# Plot the number of contract using holtman forecast model using Triple Exponential Smoothing
fig= plt.figure(figsize=(14,5))
plt.plot(contract_train.index, contract_train['count'], label='Train')
plt.plot(contract_test.index, contract_test['count'], label='Test')
plt.plot(tes_errors_df.index, tes_errors_df['Predicted_Count'], label='Forecast')
plt.legend(loc='best')
plt.xlabel('date')
plt.ylabel('count')
plt.title('Forecast using Holt Winters-Triple Exponential Smoothing')
plt.xticks(bal2.index.unique())
plt.locator_params(axis='x', nbins=10)
plt.grid(True)
plt.show()

Extra trees regressor¶

In [67]:
# Check the score for the train and test sets
print('Model Score at Train set: {:.2%}'.format(etr_model.score(X_train, y_train)))
print('Model Score at Test set: {:.2%}'.format(etr_model.score(X_test, y_test)))
Model Score at Train set: 100.00%
Model Score at Test set: 72.51%
In [69]:
# Show the predictions for Extra Tree Regressor
fig = plt.figure(figsize=(14,5))
plt.plot(contract_train.index, contract_train['count'], label='Train')
plt.plot(contract_test.index, contract_test['count'], label='Test')
plt.plot(etr_errors_df.index, etr_errors_df['Predicted_Count'], label='Forecast - ExtraTreesRegressor')
plt.legend(loc='best')
plt.xlabel('Date')
plt.ylabel('count')
plt.title('Forecast using ExtraTreesRegressor model')
plt.xticks(bal2.index.unique())
plt.locator_params(axis='x', nbins=10)
plt.grid(True)
plt.show()

Linear regression¶

In [71]:
# Show the predictions for Linear Regression
fig = plt.figure(figsize=(14,5))
plt.plot(contract_train.index, contract_train['count'], label='Train')
plt.plot(contract_test.index, contract_test['count'], label='Test')
plt.plot(lr_errors_df.index, lr_errors_df['Predicted_Count'], label='Forecast - Linear Regression')
plt.legend(loc='best')
plt.xlabel('Date')
plt.ylabel('count')
plt.title('Forecast using Linear Regression')
plt.xticks(bal2.index.unique())
plt.locator_params(axis='x', nbins=10)
plt.grid(True)
plt.show()

ARIMA e SARIMAX¶

In [73]:
plot
Out[73]:
In [75]:
# Show the predictions for SARIMA model
fig = plt.figure(figsize=(14,5))
plt.plot(contract_train.index, contract_train['count'], label='Train')
plt.plot(contract_test.index, contract_test['count'], label='Test')
plt.plot(sarima_test_df.index, sarima_test_df['Predicted_Count'], label='Forecast - SARIMA')
plt.legend(loc='best')
plt.xlabel('Date')
plt.ylabel('count')
plt.title('Forecast using SARIMA')
plt.xticks(bal2.index.unique())
plt.locator_params(axis='x', nbins=10)
plt.grid(True)
plt.show()
In [76]:
# Show the errors for the predicted and the actual values
plt.figure(figsize=(14,5))
plt.plot(sarima_test_df.index, np.abs(sarima_test_df['Error']), label='errors')
plt.plot(sarima_test_df.index, sarima_test_df['count'], label='Actual Count')
plt.plot(sarima_test_df.index, sarima_test_df['Predicted_Count'], label='Predicted Count')
plt.legend(loc='best')
plt.xlabel('Date')
plt.ylabel('Count')
plt.title('Seasonal ARIMA (SARIMA) forecasts with actual count vs errors')
plt.xticks(sarima_test_df.index.unique())
plt.locator_params(axis='x', nbins=10)
plt.show()

SVR (Support Vector Regressor) regressor¶

In [78]:
# Show predictions for Support Vector Regressor
fig = plt.figure(figsize=(14,5))
plt.plot(contract_train.index, contract_train['count'], label='Train')
plt.plot(contract_test.index, contract_test['count'], label='Test')
plt.plot(svr_errors_df.index, svr_errors_df['Predicted_Count'], label='Forecast - Support Vector Regressor')
plt.legend(loc='best')
plt.xlabel('Intervals')
plt.ylabel('Count')
plt.title('Forecast using Support Vector Regressor')
plt.xticks(bal2.index.unique())
plt.locator_params(axis='x', nbins=10)
plt.grid(True)
plt.show()

Compare the different models¶

In [81]:
metrics_table
Out[81]:
Total_Count Total_Pred_Count Model_Overall_Error MAE RMSE MAPE
Modelname
Support Vector Regressor 554100 591339.950117 37239.950117 1549.257795 2479.714193 9.785963
Linear Regression 554100 556990.797802 2890.797802 1569.672280 1956.473486 9.914912
Holtman- TES 554100 671226.045459 117126.045459 4614.475776 5460.132469 29.147564
SARIMA 554100 677672.762120 -123572.762120 4182.641275 4993.732693 26.419860
ExtreeTreesRegressor 554100 558054.540000 3954.540000 1406.727429 1881.467556 8.885663
In [82]:
# Check of the model
print('Extra Tree Regressor')
print('Model Score at Train set: {:.2%}'.format(etr_model.score(X_train, y_train))) 
print('Model Score at Test set: {:.2%}'.format(etr_model.score(X_test, y_test))) 
Extra Tree Regressor
Model Score at Train set: 100.00%
Model Score at Test set: 72.51%

Pre COVID Prediction¶

Decision Tree¶

In [ ]:
plt.figure(figsize=(4,4), dpi=1000)
plot_tree(dct,
         feature_names=["ETA","ANNO","contratto_transformed","nazionalita_transformed","genere_transformed"],
          filled=True,)

# Show Decision tree
plt.show()
In [ ]:
# Probablistic prediction the values
y_pred_prob = dct.predict_proba(X_test)
y_pred_prob
In [ ]:
# Predict the values
y_pred = dct.predict(X_test)
y_pred
In [ ]:
# Print the confusion matrix
confusion = metrics.confusion_matrix(y_test, y_pred)
print(confusion)
In [ ]:
# Show the confusion matrix
fig=px.imshow(normalized,color_continuous_scale='blues')
fig.show()

ARIMA¶

In [85]:
f, ax1 = plt.subplots(1,1,figsize=(15,5))
balanced2.plot(ax=ax1)
ax1.set_xlabel("time")
ax1.set_ylabel("Number of Contracts")
Out[85]:
Text(0, 0.5, 'Number of Contracts')
In [86]:
# Calculate the Augmented Dickey-Fuller test can be used to test for a unit root in a univariate 
#process in the presence of serial correlation.
results = adfuller(balanced2['count'])
print('ADF Statistic: %f' % results[0])
print('p-value: %f' % results[1])
ADF Statistic: -3.504387
p-value: 0.007875
In [88]:
fig, ax = plt.subplots(4, 1, figsize=(15, 6))
# Plot the series
decomposed_add.observed.plot(ax = ax[0])
decomposed_add.trend.plot(ax = ax[1])
decomposed_add.seasonal.plot(ax = ax[2])
decomposed_add.resid.plot(ax = ax[3])
# Add the labels to the Y-axis
ax[0].set_ylabel('')
ax[1].set_ylabel('Trend')
ax[2].set_ylabel('Seasonal')
ax[3].set_ylabel('Residual')

plt.tight_layout()
plt.show()
In [89]:
arima_df = arima_aic(balanced2)
arima_df
 This problem is unconstrained.
 This problem is unconstrained.
 This problem is unconstrained.
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            1     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.18383D+00    |proj g|=  0.00000D+00

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    1      0      1      0     0     0   0.000D+00   9.184D+00
  F =   9.1838298314262747     

CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL            
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            2     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.08450D+00    |proj g|=  2.03286D-03

At iterate    5    f=  9.08449D+00    |proj g|=  1.26121D-05

At iterate   10    f=  9.08449D+00    |proj g|=  2.59348D-05

At iterate   15    f=  9.08449D+00    |proj g|=  2.55795D-05

At iterate   20    f=  9.08449D+00    |proj g|=  0.00000D+00

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    2     20     30      1     0     0   0.000D+00   9.084D+00
  F =   9.0844865979947294     

CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL            
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            3     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.07312D+00    |proj g|=  1.91882D-03

At iterate    5    f=  9.07311D+00    |proj g|=  7.10543D-07

At iterate   10    f=  9.07311D+00    |proj g|=  2.30926D-06

At iterate   15    f=  9.07311D+00    |proj g|=  3.60600D-05

At iterate   20    f=  9.07311D+00    |proj g|=  4.08562D-06

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    3     24     42      1     0     0   3.553D-07   9.073D+00
  F =   9.0731115171749259     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
 This problem is unconstrained.
 This problem is unconstrained.
 This problem is unconstrained.
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            4     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.07166D+00    |proj g|=  1.95630D-03

At iterate    5    f=  9.07164D+00    |proj g|=  3.55272D-07

At iterate   10    f=  9.07164D+00    |proj g|=  2.48690D-06

At iterate   15    f=  9.07164D+00    |proj g|=  5.41789D-05

At iterate   20    f=  9.07164D+00    |proj g|=  3.55271D-07

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    4     21     28      1     0     0   1.776D-07   9.072D+00
  F =   9.0716437664765763     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            2     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.13668D+00    |proj g|=  1.11218D-03

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    2      4      6      1     0     0   3.553D-07   9.137D+00
  F =   9.1366800964879182     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            3     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.07632D+00    |proj g|=  2.06182D-03

At iterate    5    f=  9.07630D+00    |proj g|=  1.06581D-06

At iterate   10    f=  9.07630D+00    |proj g|=  2.25597D-05

At iterate   15    f=  9.07630D+00    |proj g|=  1.24345D-06

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    3     18     33      1     0     0   1.776D-07   9.076D+00
  F =   9.0763026500350730     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            4     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.07252D+00    |proj g|=  1.85310D-03

At iterate    5    f=  9.07250D+00    |proj g|=  1.84741D-05

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    4      8     11      1     0     0   1.776D-07   9.073D+00
  F =   9.0725032258534544     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            3     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.08715D+00    |proj g|=  2.28830D-03

At iterate    5    f=  9.08714D+00    |proj g|=  3.55272D-07

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    3      5      8      1     0     0   3.553D-07   9.087D+00
  F =   9.0871362631126509     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            4     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.06865D+00    |proj g|=  2.30198D-03

At iterate    5    f=  9.06863D+00    |proj g|=  1.26299D-04

At iterate   10    f=  9.06863D+00    |proj g|=  1.77636D-06

At iterate   15    f=  9.06863D+00    |proj g|=  1.98952D-05

At iterate   20    f=  9.06863D+00    |proj g|=  3.33955D-05

At iterate   25    f=  9.06863D+00    |proj g|=  1.77636D-07

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    4     25     36      1     0     0   1.776D-07   9.069D+00
  F =   9.0686291554762981     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
 This problem is unconstrained.
 This problem is unconstrained.
 This problem is unconstrained.
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            5     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.06863D+00    |proj g|=  2.33129D-03

At iterate    5    f=  9.06861D+00    |proj g|=  1.03384D-04

At iterate   10    f=  9.06861D+00    |proj g|=  1.42109D-06

At iterate   15    f=  9.06861D+00    |proj g|=  2.41585D-05

At iterate   20    f=  9.06861D+00    |proj g|=  2.25597D-05

At iterate   25    f=  9.06861D+00    |proj g|=  1.77636D-07

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    5     26     41      1     0     0   1.776D-07   9.069D+00
  F =   9.0686133041049555     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            4     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.06476D+00    |proj g|=  2.34870D-03

At iterate    5    f=  9.06474D+00    |proj g|=  8.88179D-07

At iterate   10    f=  9.06474D+00    |proj g|=  2.66454D-06

 This problem is unconstrained.
 This problem is unconstrained.
At iterate   15    f=  9.06474D+00    |proj g|=  7.49623D-05

At iterate   20    f=  9.06474D+00    |proj g|=  1.11910D-05

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    4     23     32      1     0     0   1.776D-07   9.065D+00
  F =   9.0647364237773278     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            5     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.05508D+00    |proj g|=  3.69145D-03

At iterate    5    f=  9.05504D+00    |proj g|=  7.24754D-05

At iterate   10    f=  9.05504D+00    |proj g|=  7.10543D-07

At iterate   15    f=  9.05504D+00    |proj g|=  1.24345D-06

At iterate   20    f=  9.05504D+00    |proj g|=  8.88178D-06

At iterate   25    f=  9.05504D+00    |proj g|=  7.99361D-06

At iterate   30    f=  9.05504D+00    |proj g|=  4.97380D-05

At iterate   35    f=  9.05504D+00    |proj g|=  1.77636D-06

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    5     38     46      1     0     0   5.329D-07   9.055D+00
  F =   9.0550443345253786     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
 This problem is unconstrained.
Out[89]:
p q aic bic sum_aic_bic
0 0 0 2502.001714 2507.827024 5009.828738
4 1 0 2491.176986 2499.914951 4991.091937
8 2 0 2479.701064 2491.351683 4971.052747
10 2 2 2478.662819 2496.138748 4974.801567
6 1 2 2477.720877 2492.284152 4970.005029
3 0 3 2477.487104 2492.050379 4969.537483
1 0 1 2476.980355 2485.718319 4962.698674
5 1 1 2476.754321 2488.40494 4965.159261
9 2 1 2476.66713 2491.230405 4967.897535
2 0 2 2475.886333 2487.536952 4963.423285
12 3 0 2475.608307 2490.171582 4965.779889
13 3 1 2474.972059 2492.447988 4967.420047
In [91]:
plot = results.plot_diagnostics()
In [93]:
# Plot mean SARIMA predictions
fig,ax = plt.subplots(1,1,figsize=(20,8))
plt.plot(balanced2, label='original')
plt.plot(forecast.predicted_mean, label='SARIMAX', c="r")
plt.xticks(balanced2.index.unique())
plt.locator_params(axis='x', nbins=10)
plt.xlabel('time')
plt.ylabel('Number of contracts')
plt.title('Pre-covid Situation')
plt.legend()
plt.grid(True)
plt.show()

Pre vs Post COVID¶

In [95]:
f, ax1 = plt.subplots(1,1,figsize=(15,5))
balanced_covid1.plot(ax=ax1)
ax1.set_xlabel("time")
ax1.set_ylabel("Number of Contracts")
Out[95]:
Text(0, 0.5, 'Number of Contracts')
In [97]:
# Plot mean SARIMA predictions
fig,ax = plt.subplots(1,1,figsize=(20,8))
plt.plot(balanced2, label='Real values before COVID')
plt.plot(forecast.predicted_mean, label='SARIMAX prediction', c="r")
plt.plot(balanced_covid1, label='Real values after COVID', c='g')
plt.xticks(balanced2.index.unique())

plt.locator_params(axis='x', nbins=10)
plt.xlabel('time')
plt.ylabel('Number of contracts')
plt.title('COVID Situation (Checking)')
plt.legend()
plt.grid(True)
plt.show()

Results¶